Simplify approach for creating dag run and task spans#62554
Simplify approach for creating dag run and task spans#62554dstandish wants to merge 39 commits intoapache:mainfrom
Conversation
shared/observability/src/airflow_shared/observability/traces/otel_tracer.py
Outdated
Show resolved
Hide resolved
74b3942 to
236e5ee
Compare
a1cb165 to
a85b5eb
Compare
There was a problem hiding this comment.
@dstandish Great job! I left some comments but generally it looks good and I think it can be converted to an actual PR.
Additionally, I tested it manually and I ran a dag with long timeouts. I didn't see any issues.
a7766b3 to
7e0a97e
Compare
integration tests fix static check test fix? try fix int test fix test fix tests fix tests fix integration test fix tests pre-commit
This reverts commit f453689.
d90deb0 to
b4166e7
Compare
| type: string | ||
| example: ~ | ||
| default: "False" | ||
| otel_task_runner_span_flush_timeout_millis: |
There was a problem hiding this comment.
In another config we use the full word “milliseconds” so this should match. That’s the only place we use milliseconds though; elsewhere we always use seconds instead.
There was a problem hiding this comment.
millis vs msec no opinion on (Other than the "go duration" mode of 1s/300ms would be nice here but out of scope for this PR).
As for seconds vs milliseconds here: It appears the millis is only an artifact of the Python API, and this isn't commonly set as an environment variable in OTEL installs, so we can choose what makes most sense for us, so we could have this as float number of seconds and convert to the format otel wants.
WDYT @uranusjr ?
| type: string | ||
| example: ~ | ||
| default: "False" | ||
| otel_task_runner_span_flush_timeout_millis: |
There was a problem hiding this comment.
It's too late to change this now, but the otel_ prefix on all of this is needless :(
| Timeout in milliseconds to wait for the OpenTelemetry span exporter to flush pending spans | ||
| when a task runner process exits. If the exporter does not finish within this time, any | ||
| buffered spans may be dropped. | ||
| version_added: 3.1.0 |
There was a problem hiding this comment.
| version_added: 3.1.0 | |
| version_added: 3.2.0 |
| status_code = StatusCode.OK if state == DagRunState.SUCCESS else StatusCode.ERROR | ||
| span.set_status(status_code) | ||
| span.end() |
There was a problem hiding this comment.
@nickstenning @xBis7 Question on the otel/tracing side. What happens if we never emit this DagRun span, or don't emit it for for days?
It's possible the dag run is left in the running state-- one case when this can happen is that the dag run is started and then the user goes and pauses the Dag in the UI/API, which will mean the state of the dag run is never finiallized.
There was a problem hiding this comment.
(I don't think we can do much about this, just might need to document it)
There was a problem hiding this comment.
If I understand correctly, we will only see the task spans. The dag run will need to finish so that the span is created and exported. If that happens, then the dag run span will be the parent of the task spans but until then we will only see the children.
| TraceContextTextMapPropagator().extract(msg.ti.context_carrier) if msg.ti.context_carrier else None | ||
| ) | ||
| ti = msg.ti | ||
| span_name = f"task_run.{ti.task_id}" |
| from airflow.sdk.configuration import conf | ||
|
|
||
| timeout_millis = conf.getint( | ||
| "traces", "otel_task_runner_span_flush_timeout_millis", fallback=30000 |
There was a problem hiding this comment.
Does this config need to be specific to task_runner, or could it apply to all places where we flush before exit?
ashb
left a comment
There was a problem hiding this comment.
A couple of small questions, but looks good to me.
|
@dstandish @ashb Please take a look at this #62554 (comment) I can't unresolve the conversation and the comments are hidden. Apart from that, it looks good to me. |
Replace otel spans tracking implementation with something much simpler
rip out Trace, _TraceMeta, DebugTrace "classes"--> laterremove the "Tracer" protocol and EmptyTrace--> laterdeprecate the OtelTrace class